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Abstract 

Inferring a cause from its effect using observed time series data is a major challenge in natural 
and social sciences. Assuming the effect is generated by the cause trough a linear system, 
we propose a new approach based on the hypothesis that nature chooses the “cause” and the 
“mechanism that generates the effect from the cause” independent of each other. We therefore 
postulate that the power spectrum of the time series being the cause is uncorrelated with the 
square of the transfer function of the linear filter generating the effect. While most causal 
discovery methods for time series mainly rely on the noise, our method relies on asymmetries of 
the power spectral density properties that can be exploited even in the context of deterministic 
systems. We describe mathematical assumptions in a deterministic model under which the 
causal direction is identifiable with this approach. We also discuss the method’s performance 
under the additive noise model and its relationship to Granger causality. Experiments show 
encouraging results on synthetic as well as real-world data. Overall, this suggests that the 
postulate of Independence of Cause and Mechanism is a promising principle for causal inference 
on empirical time series. 


1 Introduction 

A major challenge in the study of complex natural systems is to infer the causal relationships 
between elementary characteristics of these systems. This provides key information to understand 
the underlying mechanisms at play and possibly allows to intervene on them to influence the overall 
behavior of the system. While causal knowledge is traditionally built by performing experiments, 
boiling down to modifying a carefully selected parameter of the system and analyzing the resulting 
changes, many natural systems do not allow such interventions without tremendous cost or com¬ 
plexity. For example, it is very difficult to influence the activity of a specific brain region without 
influencing other properties of the neural system [15] . Causal inference methods have been de¬ 
veloped to avoid such intervention and infer the causal relationships from observational data only 
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[23] ;18i. To be able to build such knowledge without interventions, these approaches have to rely 
on key assumptions pertaining to the mechanisms generating the observed data. 

The framework of causality in [23, |T8] has originally addressed this question by modelling 
observations as i.i.d. random variables. However, observed data from complex natural system are 
often not i.i.d. and time dependent information reflects key aspects of those systems. Most causal 
inference methods for time series, including the most widely used Granger causality [8j, assume 
the data is generated from a stochastic model through a structural equation linking past values to 
future ones through an i.i.d. additive noise term, the “innovation of the process” [SI 22]. While 
these methods can successfully estimate the causal relationships when empirical data is generated 
according to the model assumptions, the results can be misleading when the model is misspecified. 
In particular, this is the case when unknown time lags are introduced in the measured time series. 

In this paper, we introduce a new approach to inferring causal directions in time series, the 
Spectral Independence Criterion (SIC). The idea behind SIC, as well as several new approaches 
to causal inference 0 on uni ns, is to rely on the ‘philosophical’ principle that the cause and 
the mechanism that generates the cause from the effect are chosen independently by Nature. 
Thus, these two objects should not contain any information about each other [Ml HE]. Here, 
we refer to this abstract principle as the postulate of Independence of Cause and Mechanism 
(ICM). The above mentioned methods relying on ICM refer to different domains and rely on 
quite different formalizations of the concept of “independence”. SIC formalizes the ICM postulate 
in the context where both cause and effect are stationary time series and the cause generates 
the effect trough a linear time invariant filter. The SIC postulate assumes that the frequency 
spectrum of the cause does not correlate with the transfer function of the filter. This assumption 
is justified by its connection to the Trace Method [TO] and by a generative model of the system. 
Under this postulate, we prove that SIC can tell the causal direction of the system from its anti- 
causal counterpart. Moreover, we elaborate on the connection between this novel framework and 
linear Granger causality, showing they are exploiting fundamentally different information from the 
observed data. In addition, superiority to Granger causality is shown analytically in the context of 
time series measurements perturbed by an unknown time lag. We perform extensive experimental 
comparisons, both on simulated and real datasets. In particular, we show that our approach 
outperforms Granger causality to estimate the direction of causation between to structures of rat 
hippocampus using Local Field Potential (LFP) recordings. 

Overall, the proposed method offers a new approach to causal inference for time series data with 
identifiability results, and shows unprecedented robustness to measurement delays. The promising 
empirical results suggest the SIC postulate is a reasonable assumption for empirical data, and that 
it should be further exploited to develop novel causal inference techniques. 
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2 Spectral Independence Criterion (SIC) 


2.1 Notations and model description 


We refer to a sequence of real or complex numbers a = {at,t £ Z} as a deterministic time series. 
Its discrete Fourier transform is defined by 


a(v) = ^ at exp(—i27 rut), v £ [—1/2, 1/2] =: 1 


The energy of the deterministic time series is the squared l 2 norm: ||a||| = |cq| 2 . For ease of 

notation we will also use the Z-transform of a 


a( z ) = a t z *, z £ C 
tez 


such that a{y) = a(exp(i27r^)). 

We assume that the causal mechanism is given by a (deterministic) Linear Time Invariant (LTI) 
filter. That is, the causal mechanism is formalized by the convolution 
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where h denotes the impulse response, x the input time series and y the output. We will assume 
that the filter satisfies the Bounded Input Bounded Output (BIBO) stability property [20] . which 
boils down to the condition ||h||i < +oo. Under this assumption, the Fourier transform h is well 
defined and we call it the transfer function of the system. 

We assume that the input time series x is a sample drawn from a stochastic process, {X t , t £ Z}. 
For a given index t, X t represents the random variable at index t. We use {X t } or simply X to 
represent the complete stochastic process. We use X t:s to indicate the random vector corresponding 
to the restriction of the time series to the integer interval \t .. s]. We use X t:s to indicate the random 
vector corresponding to the restriction of the time series to the integer interval [t .. s]. Assuming 
X is a zero mean stationary process (in this paper, stationary will always stand for weakly or 
wide-sense stationary 0), we will denote by C xx (r) = E [XtXt +T \ the autocovariance function of 
the process and assume it is absolutely summable. Then, we can define its Power Spectral Density 
(PSD) S xx = C xx . Under these assumptions, the power of the process P(X) = E(|X t | 2 ) is finite 
and P(X) = f_ 1 / 2 S xx {v)dv, such that S xx belongs to L [ . Moreover, we recall the following basic 
properties for our model: 

Proposition 1. Assume the weakly stationary input X is filtered by the BIBO linear system of 
impulse response h to provide the output Y. Then < +oo, h £ L°° and Y is weakly stationary 
with summable autocovariance such that 


Syy{v) = \ti(l /)\ 2 S xx (is), V£Z 


( 2 ) 


Proof. Results from elementary properties of the Fourier transform and Proposition 3.1.2. in 


0 . 
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If such a linear filtering relationship exists for X as input and Y as output, but not in the 
opposite way, we can use this information to infer that X is causing Y and not the other way 
round. If there are such impulse responses exist for both directions, say /ix ->y and /iy->x, their 
Fourier transforms are related by 

^x—> y = —-, 

^Y-s-X 

and we have to resort to a more refined criterion for the causal inference. We will assume this 
situation in the remaining of the paper. 


2.2 Definition of SIC 

Assume we are given the two processes X := {X t ,t £ Z} and Y := {Y t ,t £ Z}. Moreover, we 
assume that exactly one of the following two alternatives is true: (1) X causes Y or (2) Y causes 
X. We assume that there are no unobserved common causes of X and Y. Our causal inference 
problem thus reduces to a binary decision. In the spirit of ICM, we assume that in case (1), X and 
h should not contain information about each other and our Spectral Indpendance Criterion (SIC) 
assumes that the input power does not correlate with the amplifying factor, that is, 

(S xx \h\ 2 ) = (S xx )(\h\ 2 ), (3) 

where (/) = J x f(i/)dn denotes the average over the unit frequency interval I. Note that the left 
hand side of (eq. (y)) is the average intensity of the output signal {Y t ,t £ Z} over all frequencies. 
Hence, SIC states that the average output intensity is the same as amplifying all frequencies by 
the average amplifying factor. To motivate why we call (eq. |3)) an independence condition we 
note that the difference between the left and the right hand side can be written as a covariance: 

(S xx ■ \h \ 2 ) - (S xx )(\h\ 2 ) = Cov (s xx , |£| 2 ) . 

were we consider S xx and \h\ 2 as functions of the random variable v uniformly distributed on I. 
As a consequence statistical independence between those random variables implies that (eq. (Q)) 
is satisfied. 

Note that the criterion (eq. 0)) can be rephrased in terms of the power spectra of X and Y 
alone using (eq. 13)); which are closer to observable quantities than h: 

Postulate 1 (Spectral Independence Criterion). If Y is generated from X by a linear deterministic 
translation invariant system then we have: 

(Syy) = (S xx )(S yy /S xx ). (4) 


2.3 Quantifying violation of SIC 

This motivates us to define a measure of dependence between the input PSD on one hand and 
transfer function of the mechanism on the other hand. To asses to what degree such a relation 
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holds we introduce a scale invariant expression px-s-Y, that we call the spectral dependency ratio 
(SDR) from X to Y: 

•= (Syy) ,r\ 

' ( S XX ){Syy/S XX ) 

Here, the value 1 means independence, which becomes more obvious by rewriting (eq. 0)) as 

Cov[S xx ,\h\ 2 ] 

(S XX )(Syy/S XX ) + ' 

Finally, we note that px->Y can be written in terms of total power and energy: 

P(J) 

P ^ Y P(X)||h||| 

We then define py->x by exchanging the roles of X and Y: 


, = (S xx ) 
PY ^ X ' (S yy ){S xx /S yy ) 


( 6 ) 


2.4 Identifiability results 


In order to identify the true causal direction from SIC, it is necessary to show that px->Y and 
Py-s-x take characteristic values that are informative about this inference problem. The following 
first result shows explicitly how dependence measures in both directions are related: 


Proposition 2. (Forward-backward inequality) For a given linear filter with input PSD S xx , 
output PSD S yy and a non-constant modulus transfer function h we have 


PX^Y-pY-fX < 1 ■ 

Moreover, if 3a > 0, Vi/ G X, \h(v)\ 2 < (2 — a)||/i|||, 
then 


Px^y-Py^x < 



2 

2 


dv 


< 1. 


(7) 


( 8 ) 


Proof of this proposition is given in supplementary material. Note that ||h ||2 corresponds to 
the mean value of the transfer function due to Parseval’s theorem. According to equation eq. (@) , 
the less constant \h\ 2 is, the more the product of the independence measures will be inferior to 1. 
Assuming the SIC postulate is satisfied in the forward direction such that px->Y = 1, it follows 
naturally that py->x < 1- The two causal directions can thus be distinguished well whenever the 
transfer function deviates significantly from its mean value such that px->yPy->x is bounded away 
from 1. We then infer the causal direction to be the one with the largest p value. 

To further support that SDR values can be used empirically for causal inference, we need 
the SIC postulate to be approximately satisfied (see (eq. ( 0 ))) in systems generated according 
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to the ICM principle. We now describe a model where h is generated by some random process, 
independently of X. To this end, assume we start with a Finite Impulse Response (FIR) h, that 
is, h T = 0 for all r > m, for some m. Then h is given by m real numbers b ±,..., b m such that 

hi = bi i = 0,..., m — 1 . 

We then apply an orthogonal transformation U, randomly drawn from the orthogonal group 0(m) 
according to the ‘uniform distribution’ on O(m), that is, the Haar measure. In this way, we 
generate a new impulse response function 

h'i ■.= (Ub)j i = 0,..., to - 1. (9) 

Since orthogonal transformations preserve the Euclidean norm by definition, they preserve the 
energy of the filter. Our procedure thus chooses a random filter among the set of filters having the 
same support of length m and the same energy. We now show that for large m the resulting filter 
will approximately satisfy SIC with high probability: 

Theorem 1. (concentration of measure for FIR filters) For some fixed S xx , let p l x ^y b e ^ le 
dependence measure obtained for h' in (eq. (0)^. If U is chosen from the Haar measure on 0(m), 
then for any given e 

2e 

\Px->y-M < pp^maxS xx (z/). 

with probability 5 := 1 — exp (k(to — l)e 2 ) where n is a positive global constant independent of m, 
e, X and Y. 

Proof of this theorem is provided in supplementary material. This result provides a justification 
for using SIC provided that the dimension of the vector of filter coefficients m is large enough. The 
relevance of m will be investigated in practice in the experimental section. 


2.5 Relation to the Trace Condition 

We now describe the relation between SIC and a causal inference tool called Trace Method ma- 
Let X and Y be n-dimensional variables, related by the linear structural equation 

Y = AX + E. 

where A is an m x n structure matrix and E is a n-dimensional noise variable independent of 
X. [TO] postulate the following independence condition between the covariance matrix of input 
distribution Y,x and A: 

Postulate 2 (Trace Condition). 

T m (AY x A T ) = T n (Y x )T n (A T A ), (10) 

approximately, where r n (B) denotes the renormalized trace tr (B)/n. 
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The postulate can be justified by random matrix theory with large m when A and Ej are 
independently chosen according to priors satisfying appropriate symmetry assumptions [lOj . In 
the association between SIC and trace method we only consider square matrices and therefore 
m = n. 

To quantify the violation of (eq. (fiol )l we introduce the following quantity: 


Defnition 1 (Tracial Dependency Ratio (TDR)). The tracial dependency ratio is given by 

T n (AE x A T ) 


TX—tY ■ = 


T n {Ti X )Tn{A T A) 


(ID 


We thus can see that the tracial ratio plays a role analog to our spectral dependency ratio p 
in the finite dimensional case. We can actually show that SIC can be viewed as a limit case of the 
Trace Condition by defining the following truncated system. 

Defnition 2. To any given infinite dimensional linear system I 4 Y = h * X, the truncated 
system of order N is defined by zeroing the input and the output values for integers k such that 
-N <k< N: 

X*n = X-n-.n-i l— t Y' n = (h* X' N )_ x . x _i, 


Note that in this definition for each N, the vectors Y , _ n . n _ 1 are inherently different. The 
mapping defined in this way is linear and can be written as Y 7 = HX' with [Hfij = hi-j , such 
that the trace method can be applied to it. We then have the following result showing that SIC 
can be obtained from the Trace Condition as an appropriate limit: 

Theorem 2. Let r X ' N ^Y' N represent the tracial ratio for the truncated systems of order N for a 
given linear system with SDR px->Y • Then 

lim r x > . y< = px-s-Y 

N-> OO « W 


The proof, together with two necessary lemmas is available in supplementary material. 


3 SIC for vector autoregressive models 

SIC and Granger causality rely on completely different assumptions but both apply to linear time 
series models. In this section, we study the classical Vector Autoregressive (VAR) model used in 
Granger causality from the SIC perspective to better understand the relation. 


3.1 VAR model 

We assume the observed time series are generated by a VAR model such that x Granger causes y. 


UkXt-k + £t 
k 

(12) 

^ ^ b k Yt-k H - ^ ^ CkXt—k “h £t 

k k 

(13) 
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Both noise terms e and £ in this expression are i.i.d normal noises. 


3.2 Applying SIC to VAR models 


We want to rewrite this expression such that Y is obtained from X by a deterministic linear time 
invariant filter. We observe that the VAR model can be cast as linear time invariant filter if we 
neglect the additive noise £. Indeed, then the mechanism is the following ARX (AutoRegressive 
with exogenous input) model R:. 

Y t = y, b k Y t~k + CkXt ~ k ( 14 ) 

k k 


Using basic properties of the Z-transform, we can derive the following analytic expressions of the 
input PSD S xx : 

S xx (v) = \n{v)\ 2 = |n(exp(27rii/))| 2 , 


with 


i(z) = 


1 - E fc a kZ 


-k ' 


Moreover, the transfer function corresponding to the mechanism in equation eq. (11411 is 


rh(z) = 


EfcCfc^- 


1 - Efc hz 


—k 


As a consequence, testing SIC on the VAR model in the forward direction amounts (when neglecting 
the filtered noise £), to test independence between 


\h{y)\ 2 = |m(exp(27riz/))|^ 


(15) 


and 


S xx {v) = |n(exp(27rk')|" 


(16) 


which are parametrized by the coefficients {bk,Ck} and {a*,} respectively. We conjecture that a 
concentration of measure result similar to Theorem [l| holds stating that independent choice of the 
coefficients from an appropriate symmetric distribution typically yields small correlations between 
(eq. (flil l and (eq. (16)). This will be tested empirically in the Experiments section. Additionally, 
the robustness of our approach to noise in the VAR model will be addressed extensively in a longer 
version of this manuscript. 


3.3 Comparison of SIC and Granger causality 

The bivariate VAR model above is the typical model where Granger causality works. To recall 
the idea of the latter, note that it infers that there is an influence from X to Y whenever pre¬ 
dicting Y from its past is improved by accounting for the past of X. Rephrasing this in terms of 
conditional independences, X is inferred to cause Y whenever Y t is not conditionally independent 






of X t _i,X t _ 2 , ■ • ■, given Y t _i,Y t _ 2 ,_ Within the context of the above linear model, knowing 

X t -i,X t - 2 , ■ ■ ■ reduces the variance of It, given Y t -±, Yt- 2 , ■ ■ ■ because then the noise u t is the only 
remaining source of uncertainty. Without knowing X t -±, X t - 2 , ■ ■ ■, we have additional uncertainty 
due to the contribution of et_i, et_ 2 ,.... 

SIC, on the other hand, does not rely on detecting whether X helps in improving the prediction 
of Y. As demonstrated above, SIC applied to a bivariate VAR model boils down to quantifying 
independence between two linear filters defined by set of coefficients, the filter generating the 
input with transfer function n and the filter of the mechanism with transfer function to. This 
is a completely different concept. One can easily imagine that the coefficients {bk,Ck} and {a*,} 
can be hand-designed such that the functions (eq. 0 )) and (eq. 0 )) are correlated. This would 
spoil SIC, but leave Granger unaffected. On the other hand, the following subsection describes a 
scenario where Granger fails but SIC still works. 


3.4 Sensitivity to Time Lag 

Consider two time series {X t } and {Y t } where {X t } is a white noise and 

Vt £ Z, Yt = cY t -i + Xt— 1 , 

for a given c. It can be easily seen that this type o f inp ut and output can be simulated using an 
HR filter with ( 01 , 02 ) = (l,c) and b\ = 1 in (eq. (1171 11 an d th e rest of the coefficients are zero 
(please refer to the definition of coefficients in section section 14.11 1. The infinite DAG for this causal 
structure can be seen in fig. [l|. 



Xt -1 X t Xt+i Xt +2 


Figure 1: The original causal structure with instantaneous causal effect 

Now if there would be a measurement delay of length k for Y, the observed values will be a 
new time series, say Y, where Y t = Y t -k- Although the ground truth is X —> Y independent 
of fc, Granger causality only infers the correct causal structure if k < 0 (where there is a lag in 
measurement of X, but not Y). However SIC always infers the correct direction (except when 
c = 0 and the time structure is spoiled). This is because the PSD of the white noise X is constant 
and depends only on the total power, i.e, 

S xx (v) = Var(A' t ) = P{X} , 

for all v £ [—1/2,1/2]. and obviously, this constant remains the same for the lagged time series. 
Thus, SIC correctly identifies the causal structure (except when c = 0 in which case the dependence 
to time is completely spoiled). 
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4 Experiments 


In this section we study our causal inference algorithm using synthetic experiments and apply it 
to several real world data sets. 

4.1 Synthetic Data: ARM A filters and processes 

We designed synthetic experiments to assess the validity of the SIC approach. The data generating 
process is as follows. The LTI system S modeling the mechanism is chosen among the family of 
ARMA(FO,BO) filters with parameters (a, b) defined by input-output difference equation: 

1 FO BO 

Vn — 1 ^ bi%n—i T ^ O'j '{jii—j ) ■ (17) 

a 0 n 

2=0 J=1 

For these filters FO is known as the feedforward order and BO is the feedback order. aA s and 
bAs are known as feedback and feedforward coefficients respectively. Note that when FO(S) = 0, 
the system is called and autoregressive filter. Alternatively, BO(S) = 0 corresponds to the family 
of Finite Impulse Response (FIR) or Moving Average filters. Whenever BO(S) ^ 0: the filter 
has Infinite Impulse Response (HR). The input of the causal model will be chosen among the 
family of ARMA(FO,BO) processes, which are generated by filtering an i.i.d noise input with an 
ARMA(FO,RO) filter. We thus chose two filters S and S', with parameters (a, b) and (a / ,b / ) 
respectively. To simulate a cause effect pair X,Y, we generated the cause X by applying S to a 
normally distributed i.i.d noise. Then, we generated Y by applying S' to X. The feedforward and 
feedback orders of both systems S and S' were chosen identical in all experiments. 

In each trial all the elements of vectors a, a', b and b' except the first ones (i.e. ao,b 0 , a' 0 ,b' Q 
which were fixed to one) were sampled from an isotropic multidimensional Gaussian distribution 
with variance 0.01. Coefficients are sampled using rejection sampling such that only BIBO-stable 
filters are kept. 

We simulated sequences of length 10000. The PSD of X and Y were estimated using Welch’s 
method [Uj. We repeated this experiment 1000 times. Figure fig. 0 shows an example of the 
distribution of px-»Y and py-»x and of their difference using FO(S) = BO(S) = FO(S') = 
BO(S') = 5. 

The SDR for the correct direction of is concentrated around one, while in the wrong direction 
the estimator stays less inferior to one for most of its probability mass (in this example %97.3). 
This results in a positive difference between SDR for most of the probability mass. Accordingly, 
our inference algorithm based on the sign of this difference algorithm [l| will select the correct 
direction in most of the cases. 

Based on this inference algorithm, we test the effect of the filter orders on the performance of 
the method, where we evaluate the performance of each setting of FO(S') and BO{S') over 1000 
trials. We varied the orders between 2 and 21 and compared the performance of the cases FO(S') = 
BO(S'), FO{S') = 0 and BO(S') = 0. Considering that the experiments are independent and 
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Figure 2: Top plot: Histogram for the estimators of px-yY and py^x- Bottom plot: Histogram 
of the estimated difference px^Y — Py^x 


based on the assumption that our method is successful with probability p where p has a binomial 
distribution, we calculated confidence intervals using Wilson’s score interval [25] where a = 0.05 
(and therefore z n / o = 1.96). The performance increases rapidly with filter order, as can be seen 
in the plots of fig. y. Moreover, the feedforward filter coefficients seem the most beneficial to the 
approach, since their absence leads to the worst performance ( fig. d red line). 

4.2 Real World Examples 

We tried our method over several examples of real data where the ground truth about the causal 
structure of the data is known a priori and the data is labeled in a way that the ground truth is 
X — > Y. In the first two examples we plotted the difference of SDR in both directions as a function 
of the window length used in Welch method which can be seen in fig. 0. 


4.2.1 Gas Furnace [4j 

This dataset consists in 296 time points, with X the gas rate consumed by a gas furnace and Y 
the produced rate of CO 2 . fig. li shows px-yY ~ Py->x against the window length, which was 
ranging from 50 to 150 points. As illustrated, the difference is always positive and our method 
is able to correctly infer the right causal direction independent of window length. TiMiNO and 
Granger causality correctly identified the ground truth in this case as well na¬ 


il 













Algorithm 1 SICJnference 
l: procedure SIC_Inference(X,Y) 

2: S xx t— spectrum of X 

3: Syy t— spectrum of Y 

4: Calculate px-s-Y and py->x using (eq. |3)) 

5: Inference Step: 

6: if px-> y > Py-s-x then 

7: return X —> Y 

8: else 

9: return Y —> X 
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Figure 3: SIC performance against filter order for synthetic experiments for different types of filters 
(see text). 

4.2.2 Old Faithful Geyser [2j 

N = 298 : X contains the duration of an eruption and Y is the time interval to the next eruption 
of the Old Faithful geyser. Figure fig. [1 represents the difference in SDRs as a function of win- 
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dow length with the same configuration as the gas furnace experiment. Again the correct causal 
direction is inferred by our method independently from the window length as illustrated in fig. 0. 
In this case TiMiNO correctly identifies the cause from effect but neither linear nor non-linear 
Granger causality infer the correct causal direction [Si- 



Figure 4: Difference between the estimators of SDRs in both directions against window length of 
the Welch periodogram. 


4.2.3 LFP recordings of the Rat Hippocampus 

It is known that contrary to neocortex where connectivity between areas is bidirectional, monosy¬ 
naptic connections between several regions of the hippocampus are mostly unidirectional |T]. An 
important example of such connectivity is between the CA3 and CA1 subfields [T]. Despite this 
anatomical fact, a study of causality based on Local Field Potential (LFP) recordings of CAl 
and CA3 of the hippocampus of the rat during sleep reports that Granger causality infers strong 
bidirectional relations between the two areas 0- 0 explains the possible reasons of such result 
as feedback loops involving cortex and medial septum, and diffuse connections going from CAl to 
CA3. 

To do a comparison with Granger causality, we applied our framework to recordings from those 
regions using a publicly available datase10 [HI [17]. LFP’s were recorded using a 8 shank probe 
having 64 channels downsampled to 1252Hz. Shanks were attributed by experimentalists to the 
CAl and CA3 areas (leaving 32 channels for each area). For more information on the details of 
gathered data please refer to m- We used the data for rat “vvpOl” during a period of sleep and 
a period of active behaviour in a linear environment. We applied linear Granger causality using 

1 http: / / crcns.org/data-sets/hc 
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an implementation from the statsmodel Python librarjQ We considered a forced decision scheme 
for Granger causality (to make it comparable to our method), were we select the correct Granger 
causal direction as the one having the lowest p-value for the null hypothesis of absence of causal 
influence. Following the usual methodology of causality analysis we divided the duration of 
ten minutes into 300 intervals of two seconds (N = 2504) to reduce the effect of nonstationarity in 
data analysis, and performed SIC causal inference on each interval for each electrode pair. We took 
two different approaches to report assess the performance of methods: one, based on a majority 
vote over all 300 intervals for each channel pair, and two, by assessing the average performance 
based on individual time intervals. The results are plotted as histograms in fig. |5| and they show 
that SIC clearly outperforms Granger causality on this dataset. The confidence intervals are once 
again based on Wilson score but obviously this time the in dependancy assumption between the 
trials is not well justified, specially for pooling all the results. 



Figure 5: Average performance of the linear Granger causality and SIC methods for deciding 
CA3—>CA1 ground truth direction against the opposite. For both the linear and sleep sessions the 
performance is significantly above the chance level for SIC. * indicates the use of a majority voting 
scheme. 


2 Statsmodels: Statistical library for Python More details on null hypothesis for Granger causality can be found 
on the website. 
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4.2.4 Characterizing the Echo 


The echo effect of a room over a sound generated in the room can be well estimated by a convolution 
of the real signal with a function known as room impulse response function. In this experiment we 
used an open source database of room Impulse Response Function (IRF) available at the Open AIR 
librarjjfl We chose the IRFs for Elevden Hall, Elevden, Suffolk, England and Hamilton Mausoleum, 
Hamilton, Scotland. We convolved these signals with 30 ±5 seconds segments of two classical music 
pieces: the first movement of Vivaldi’s Winter Concerto consisting of 9190656 data points, and 
the Lacrimosa of Mozart’s Requiem, consisting of 8842752 points, both ‘.wav’ files with the rate of 
44100Hz. Regardless of the segment the SDR in forward direction is considerably larger than the 
SDR in the backward direction as can be seen in fig. 0. In another experiment we used a computer 


— Cloackroom — Lord 

— MarbleHall 

Cloackroom — Visitors 

SmokingRoom 



2 3 4 5 6 7 

Audio segment 



123456789 

Audio segment 


Figure 6: The plots represent the value px t ^Y t — py t ^x t for 4 different environments as a function 
of different music segments. The method correctly infers the causal direction in all the cases. 

to play the musical pieces above in an academic Lecture Hall (labelled as “Hall” in plots) and in 
an office room (labelled as “Room” in plots) and recorded the echoed version in the environment. 
In a series of different tests, we split the data into 9,17, 33,65,129 pieces, and we ignored the last 
piece so that all the pieces would have an equal length. In each test we averaged the performance 
of our causal inference method over all the segments and plotted this performance against the size 
of the window length in Welch method. The window size was varied between 500 and half of the 
length of the music segment length (which is dependent on the number of segments). The results 
can be found in fig. @ and show a very good performance of the approach for large window lengths 


3 Open AIR: Open source library for acoustic IRFs. 
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Figure 7: The performance of the method over real echoed audio signals recorded simultaneously 
by playing the piece in two different closed environments that have their own acoustic structure. 


5 Conclusion 

We have introduced a causal discovery method for time series based on the SIC postulate, assuming 
a LTI relationship for a given pair of time series X and Y, such that either X —>■ Y or Y — > X. 
Theoretical justifications are provided for this postulate to lead to identifiability. Interestingly, the 
method provides and extension of the recently proposed Trace Method approach to the time series 
setting. Encouraging experimental results have been also presented on real world and synthetic 
data. Specially this method proved to be more effective than linear Granger causality on LFP 
recordings from CAl and CA3 hippocampal areas of rat’s brain, assuming a ground truth causal 
direction from CA3 to CAl based on anatomy. We suggest that this method can provide a new 
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perspective for causal inference in time series based on assumptions fundamentally different from 
Granger causality. We will address the existence of confounders, establish a statistical significance 
test (for example using a procedure inspired by |26'j). and extend this method to multivariate time 
series in future work. 
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Supplementary Material 


We have prepared an appendix to address the proofs for Proposition theorems [l| and [3 which 
we provide in two different sections. For this purpose we will use a few extra notations which we 
define here. We will also use t(A) and tn(A) interchangeably for the normalized trace of a square 
matrix A of order N. 


6 Proof of Proposition \2 


Lemma 1. For f £ L 2 (I) positive, non-constant, such that 1// £ L 2 (I), we have 


L mJx -J I 7W dx>1 


Proof. Using Cauchy-Schwartz inequality for the scalar product 

(fix), = [ f(x).-^—dx = 1. 

fix) JI /( X) 

Inequality is strict since / and 1/f are not collinear (otherwise / would be constant). □ 

Lemma 2. Let f £ L 1 (I) be positive, non-constant, such that 1/f £ L 1 (X) and f x f(x)dx = 1. 
Assume 3a > 0, \/x £ I, f{x) <2 — a , 
then 

[ f{x)dx. [ -j—dx > 1 +a [ (f(x) - 1 ) 2 dx 
Ji Ji fix) Ji 

Proof. We denote s{x) = f{x) — 1. Then f x s(x)dx = 0 and 


U {x)dx Lw) 


dx — 1 = 


-■s(aQ 

1 + s(x) 


dx 


For x > — 1, we have 


—x 
1 + X 


> X — X — X. 


(18) 


The function on the l.h.s. of (eq. (1181 ) 1 is convex because its second order derivative is 

positive and using its tangent in x = 0, we get 

[ f(x)dx. [ dx — 1 > f s(x) 2 (l — s(x))dx 

JI Ji fix) Jx 

Since 1 — s(x) = 2 — f(x) > a > 0, 


f{x)dx. [ dx — 1 > a / s(x) 2 dx 

:' Jx fix) Jx 


□ 
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Proof of Proposition 0. By using the definition of Spectral Dependency Ratios and Lemma 0 we 
get 


Px-j-yPy-s-x = 


< 1 


(W 2 XVN 2 > 

Moreover, applying Lemma 0 to / = \h \ 2 / f x \h \ 2 = |^| 2 /||h ||2 we get inequality eq. fl8|). □ 


7 Proof of Theorem 


To prove this theorem we rely on a theorem from |10] and a corollary that we derive from it. 

Theorem 3 (concentration of measure for finite dimensional linear relationships). 1 1 Of Suppose 
Y> is a given covariance matrix and suppose A £ M nxm (WL) is also a given matrix. Then if one 
generates E^ = LTEU t by uniformly choosing an orthogonal matrix U from 0(n) then E^ together 
with A, satisfies trace condition in probability when n tends to infinity. More precisely for a given 
e there exist S := 1 — exp(«(?z — 1)£ 2 ), k being a constant where 

\T m (AV x A T ) - r n (S A -)T m (Ad T )| = \T m (AUZU T A T ) - r„(E)r m (^ T )| < 2 £ ||E||PT t || 

holds with probability 6. 

In the above theorem (and the rest of the document), ||.|| applied to a matrix will refer to the 
operator norm. The following corollary is a direct consequence of the previous theorem: 

Corollary 1. Suppose E is a given covariance matrix and suppose A £ M„ xm (R) is also a given 
matrix. Then if one generates Ajj = AU by uniformly choosing an orthogonal matrix U from 0(n ) 
then Ajj together with E, satisfies trace condition in probability when n tends to infinity More 
precisely for a given e there exist 6 := 1 — exp(«:(n — 1)£ 2 ), k being a constant where 

\ T m{AjjT,Ajj) — T n (T,x)T m (AA T )\ = 

\T m (AUZU T A T )-T n (Z)T m (AA T )\ < 2e||E||||TA t || 


holds with probability S. 

To prove the main theorem we will also need two lemmas that are stated below. 

Lemma 3. t22ty For a given Hermitian matrix H and any principal submatrix of H, H', their 
spectral radius p s satisfies 


Ps{H) > p s (H'). 

Lemma 4. 0/ Let f : [— ^) —> R / £ L 1 be a bounded function and suppose tj. is its Fourier 

series coefficients, i.e. 

1 

t k = ^ f(v)e i2nku dv, t£Z. 
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Consider Toeplitz matrices T n defined as 

[-^ri]yj — b i—j b j £ {Oy kl 1} 

with eigenvalues T nj k (0 < k < n — 1). Then if ti are absolutely summable we get: 

min f(x) < T nt i < max f(x) 

Proof of Theorem fl Without loss of generality and for the sake of simplicity we only consider the 
positive indices of the time series and we take the filter to be causal; other cases can be treated in 
a similar way. Then the following relation holds between input and output of the filter: 

m— 1 

Vi, 0 < i < N - 1 Yi = b 3 X i-j 

j=o 

Formulated in terms of matrices the above relation can be represented as 


Yo 


A-m+l 

Vi 

= B 

A-m+2 

Pv_ 2 


X N- 2 

Pv -1 


X N-1 


where B is a iV x (N + m — 1) matrix as follows: 


bm—l bm—2 

0 bm—l 


bo 

bi 


0 

bo 


bm—l * ' ' 

0 bm—l 


0 0 
0 0 

b 0 0 
bi b 0 


We define £V €E 


Vi 


to be the covariance matrices as follows: 

0 < i < N — 1 0 < j, k < m — 1 [ X ’x}jk = 

dov(A^_|_j, A,4-) 


Since the time series that we are dealing with are weakly stationary it is obvious that E^- is 
independent of i. If we take Ex 0;f ,_i, Ey 0;JV _i G Mn x n(M) to be the covariance matrices for 
Xo:at_i and Yo:iV-i respectively, then we have 


j Yq:N - 


= BY> 


X-m+l:N-l 


B l 


Also define E^ 0 . JV _ 1 to be the covariance matrix of the output for FIR S' with b' = U T h. Also 
assume the spectrum of the output for this filter is S yy . One can write diagonal elements of Eyo.^., 
and E y 0 . N _ i based on the above equation as follows: 




= b ' S^b, 


P 


u 1 
r 0 : jy_iJ 


= b l UE i x U T b 
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and therefore the normalized traces of Ey- 0;JV _ 1 and £y- 0 , N _ 1 can be written as 


JV-l 


tn{ 


= ^b T E S( Y b, 


2 — 0 
N—l 


tn{ 


( S 5L_J = ^ T U E S -^ T b 


2=0 


Define £ := Etc' &x = ^x- Taking A = b T in corollary corollary 0 for a randomly selected U 
we get 


and therefore 


|-b T £/£t/ T b- -r m (E)(b,b)| < 2e||E|| N /(Vb> 


l^(^ 0;JV _ 1 )--r ro (£)||b||^|<2 e ||£||||b||| 


(19) 


with probability S. On the other hand the elements of diagonals of £ Y -’s are C'x(Q). Therefore: 

^-< s > = = p < x > 

Since £ Y ’s are principal submatrices of T,x 0 . N _ 1 therefore by corollary lemma 0 

1 JV-l 1 N -1 

||£|| = p(£) = ||- E Evil <^E H S *H < p(E 

2=0 2=0 

Because Cx(t)’s are absolutely summable we apply lemma lemma Q and we get 

P(E*o:jv-i) < m ax $XX (y) 5 

V 

such that inequality eq. Ilf)! ) can be rewritten 


J P(X)||b||| 


- 1 | < 2 


p{xy 


which completes the proof. 


□ 


8 Proof of Theorem 2 


In this section we give a proof that the TDR (see eq. eq. (Illh l asymptotically approaches the 
SDR (see eq. eq. (|5|)). We first state and prove two lemmas that are used to derive this result. As 
before suppose {X t } and {Y t } are given input and output of an LTI filter that are related through 
the impulse response function {h t }. According to the definition of the truncated linear systems 
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(see definition definition |2|) of order N for the linear system above we get the following matrix 
relationship: 


Y r 

I -N 


7 

o 

i_ 

h~ 2 N+i 


X-N 

Y' 

1 -N +1 


5- 

. 5“ 

o 

h- 2 N+ 2 


X-N+l 

Y> 

1 N-2 

Y' 

1 N-l J 


h 2 N — 2 h 2 N-3 • 

h 2 N-i h 2 N- 2 • 

1 

7 ° 

J - 


Xn -2 

X N -i 


If we name the vector on the left as y n, the matrix as H N and the right vector as x^r then the 
associated TDR yields: 


r x N ^y N 


_ t n{^ y N ) _ 

t n (Z xn )t 2N (H»H" T ) 


( 20 ) 


Define Tjv := t 2 n(H n H nJ ). Now we show that Tjv converges to ||h|j| the energy of the impulse 
response. 

Lemma 5. Assume ||/i||§ < +oo, then 


lim T N = \\h \\ 2 2 

N —>-+oo 


Proof. First lets simplify the expression for T/y: 


2JV-1 


= if I' 1 *' 22 " ^ 

i,j k=—2N+l 


2N 


ii, 12 27V |fc| ^ , 2 2N — \k\ 

Y IM 9N + IM 


2N 


k=—2N+l k=0 

It is easy to see that T/v is an increasing sequence of N. Moreover it is bounded by 


y \h k \ 2 < oo. 


( 21 ) 


Therefore this series converges. In order to show that it converges to |jh|| 2 , we first notice that for 
a given e, there exist mo € N such that 

m 

Mm > mo | \h k \ 2 - ||h|| 2 | < e. (22) 

k=—m 


Now take N mo > ^ \ h ™o\ . w e h ave 




mo2 m ° +1 \hm 0 \ 2 

£ 


\hm 0 \ 2 m 0 £ 

2N mo < 2™o+2 • 
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Same can be done for any 0 < k < mo, i.e. there exist N k such that: 

\h k \ 2 k ^ e 
2N k < 2 k + 2 

Now take -/V max = max{AT 0 , Ni, ..., N mo } + 1. Then obviously we get: 

|2f 


|hfc| (2iV max - k) 


And therefore: 


E 

k—0 


\h k \ 2 - 


2N n 


\h k \ 2 (2N max - k ) 


< 


2 k +2 


2N,r 


< 


E 

0 


£ £ 
2 k + 2 < 2 ' 


(23) 


Similar results hold for the first sum term in (eq. (12111 1 and by taking the maximum of two JV max ’s 
(say Nina.*) an d considering the fact that TV is increasing and by the application of triangular 
inequality for (eq. (12211 1. we can easily infer that 




Tn - ||h|| 


< e. 


□ 


In order to get the main result, we also need to prove that Y^s in (section^) are asymptotically 
converging to Y k s in the following sense: 

Lemma 6. Suppose an LTI filter S with zero mean weakly stationary processes as input ({X t }) 
and output ({Y t }) and impulse response function {h t } has been given. Then for the truncated linear 
systems we have: 


Jin^ ~ t(X YLn:N _J\ = 0, 


Proof. For simplicity of calculations we name 2N dimensional random vectors Y'_ N . N _\ and 
y_ jV :iV-i as Y' and Y and their covariance matrices with Yy and S y respectively. Then we 
have: 


r(£r —= t(E(YY t )) - t(E(Y'Y' )) 


2N 


E(y y) — E(y' y 7 ) = 


2N 


E((y-y , ) T (y+ y'))| < — E|(y-y') T (y + y')| < 


-E 


2 N 


(^/(y-y') T (y-yo x ^/(y + y') T (y + y')) 


< 


2^V E (( y “ y ' )T(F - ^ X + F ')) = 

\I 2 lv mY ~ Y ' V{Y ~ Y ' )] X \/Jn E({Y + Y ' V{Y + y,)) = y/T^Y-Y'Wr^Y+Y') 


where (*) and (**) follows from the fact that one can take trace (or normalized trace) into ex¬ 
pectation and vice versa, and moreover from the fact that tr (AB) = tr (BA) for any two matrices 
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that their multiplication is well defined. The inequalities are the result of the application of 
Cauchy-Schwartz inequality for covariances of random variables. First we show that \J r(Ey+y/) 
is bounded as a function of N. Define { h ^ 11 } as follows 

h U) = ! 2ht ^ —N<t+j<N—1 
| ht otherwise 

We can bound each element of diagonal of T,y+y' as follows 

oo oo 

Py+y,]^ =E[(Yj +Y') 2 ] = e[( E x i-ih\ j) f] <e[( E I^mII&W] < 

l ——oo l ——oo 

oo 

4E[( E l^-HI^I) 2 ] =4Cy(0), 

l ——oo 

and therefore r(Ey + y<) is bounded. 


Now we show that each element of diagonal of Ey_y< tends to zero when N tends to infinity 
which will complete the proof. With overload of notation, in this case define {h[^} as follows 


h[ j) = 


0 if - N<t+j<N-l 
ht. otherwise. 


Then for the j-th element of diagonal of Ey_yi we have 


P Y-Y'ki = e[(YJ - Y') 2 ] = E[( E Xi-ih?) 2 ] = e[( E x 3-i h i? 

l=— oo 1>N—j 

l<-N-j 

Since autocorrelation function attains its maximum at t = 0 and 


Vi, j G Z, E (XtXj) < y E(X?)E(X?) 

we get: 

Vi, j G Z, E(XjXj) < E(-Xq). 

As a result we have: 

[Ey—y'],j= E[( E E E ( x o)hihu = E(X 0 2 ) E < 

l>N-j l,l’>N-j l,l'>N-j 

K-N-j i^i' 

E(X 2 )( E M 2 <e(*o 2 )( E n ) 2 

l>N—j l>N-j 

K-N-j l<-N-j 

Now since {h t } is absolutely convergent, it follows that [Ey_y'], 3 can be arbitrarily reduced by 
increasing N. Then it follows that t(Ev-Y') approaches to zero when N tends to infinity. □ 


25 



Finally to complete the proof of the theorem regarding the asymptotic behaviour of trace 
condition in the truncated linear systems and the equivalence of trace condition (see postulate 
postulate [1) to SIC, we need one of the convergence theorems due to Szego: 

Theorem 4 (Szego’s convergence theorem). Let f : [— i) —>• K. / £ L 1 be a bounded function 
and suppose tf. ’s are its Fourier series coefficients, i. e. 


tk — 


f(v)e 


i2nkiy 


dv, t G Z. 


Consider Toeplitz matrices T n defined as 


b j £ {0, ...,Tl 1} 

with eigenvalues T n> k {0 < k < n — 1). Then if T n ’s are Hermitian, i.e. U = t) for any i, then for 
any continuous function F we have: 


^ n— 1 

lim - V F{r n k) = 

n—foo Ti L — 
k =0 


F{f{v))dv 


We are ready to state our convergence theorem: 


Theorem 5. For a given truncated linear time series, rx' N ^-Y' N asymptotically approaches to the 
spectral values of time series on infinite domain. As a result the spectral density based estimator 
coincides with the trace based estimator in the limit, and more precisely 


2 2 

lim t(£ X n )= / S xx {v)dv, lim r(E yN ) = / S yy (v)dv 
N—voo J N—*oo J 


and lim T N = / \h{v)\ 2 dv, 
N—foo I 


where T^ is defined as in (eq. (|2lh ). And eventually: 


Ihn r X ' N ^Y' N = Pyl->y 


lim r Y >^x> = Py~>yl 

n—foo N N 


Proof. Both E Xn and E yN are hermitian Toeplitz matrices and based on theorem theorem 0 where 
F has been chosen as identity function and also applying lemma lemma @ we get: 


2 


lim t(E X n ) 

-f 

S X xiy)dv 

(24) 

N—foo 

J 

1 

2 

lim r(E yN ) 

1 

2 

-I 

Syy{v)dv 

(25) 


1 

2 
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Moreover by Plancherel’s theorem and lemma lemma |5| it follows that: 


1 

2 

lim Tn = ||h||| = [ \h{v)\ 2 dv (26) 

>oo J 

_ 3 . 

2 


□ 

This theorem therefore shows that the trace ratios calculated for windowed version of time 
series are nothing but estimates of the spectral ratios and therefore justifies that these two different 
methods for causal inference are indeed consistent with each other. 
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